log_loss (cross-entropy / negative log-likelihood)#

log_loss measures how well predicted probabilities match the true labels. It is the standard objective for logistic regression / softmax classifiers, and a common evaluation metric for probabilistic models.

Learning goals#

  • understand the binary and multiclass formulas (with notation)

  • build intuition for why confident mistakes are punished heavily

  • implement numerically-stable log loss in NumPy (from probabilities and from logits)

  • see how minimizing log loss trains logistic regression via gradient descent

  • know when log loss is the right metric (and when it is not)

Quick import#

from sklearn.metrics import log_loss

Table of contents#

  1. Definitions and notation

  2. Intuition (plots)

  3. NumPy implementation (binary + multiclass)

  4. Using log loss to optimize logistic regression

  5. Pros, cons, pitfalls

import numpy as np

import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots

from sklearn.datasets import make_blobs
from sklearn.metrics import log_loss as sk_log_loss
from sklearn.model_selection import train_test_split

pio.templates.default = 'plotly_white'
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(0)

1) Definitions and notation#

Assume we have \(n\) examples.

Binary classification#

  • True label: \(y_i \in \{0,1\}\)

  • Predicted probability of the positive class: \(p_i = P(y_i=1 \mid x_i)\)

Per-example log loss (Bernoulli negative log-likelihood) is:

\[ \ell_i = -\Big(y_i \log(p_i) + (1-y_i)\log(1-p_i)\Big) \]

Average (optionally weighted) log loss:

\[ L = \frac{1}{n}\sum_{i=1}^n \ell_i \]

Multiclass classification (\(K\) classes)#

  • True label: \(y_i \in \{0,1,\dots,K-1\}\)

  • Predicted probabilities: \(p_{ik} = P(y_i=k \mid x_i)\) with \(\sum_{k=0}^{K-1} p_{ik}=1\)

If we write one-hot targets \(y_{ik} \in \{0,1\}\), then:

\[ L = -\frac{1}{n}\sum_{i=1}^n \sum_{k=0}^{K-1} y_{ik}\log(p_{ik}) \]

Equivalently (using integer labels):

\[ L = -\frac{1}{n}\sum_{i=1}^n \log\big(p_{i, y_i}\big) \]

Why the log?#

If a model assigns probability \(p_{i,y_i}\) to the true class, the likelihood of the dataset is:

\[ \prod_{i=1}^n p_{i,y_i} \]

Taking -log turns a product into a sum:

\[ -\log\Big(\prod_{i=1}^n p_{i,y_i}\Big) = -\sum_{i=1}^n \log(p_{i,y_i}) \]

So minimizing log loss is the same as maximizing likelihood.

From logits (numerical stability)#

Sometimes models output logits (real-valued scores) instead of probabilities.

Binary: logit \(z_i = w^\top x_i + b\), \(p_i = \sigma(z_i)\) where \(\sigma\) is the sigmoid.

A stable per-sample loss is:

\[ \ell_i = \operatorname{softplus}(z_i) - y_i z_i,\quad \operatorname{softplus}(z)=\log(1+e^z) \]

Multiclass: logits \(z_{ik}\), softmax probabilities \(p_{ik} = \frac{e^{z_{ik}}}{\sum_j e^{z_{ij}}}\).

Stable loss:

\[ \ell_i = \log\Big(\sum_{k=0}^{K-1} e^{z_{ik}}\Big) - z_{i,y_i} \]

In practice we also clip probabilities with a small \(\varepsilon\) to avoid \(\log(0)\) (which would be \(+\infty\)).

def sigmoid(z):
    z = np.asarray(z, dtype=float)
    return np.where(z >= 0, 1.0 / (1.0 + np.exp(-z)), np.exp(z) / (1.0 + np.exp(z)))


def softplus(z):
    z = np.asarray(z, dtype=float)
    return np.logaddexp(0.0, z)


def logsumexp(a, axis=None, keepdims=False):
    a = np.asarray(a, dtype=float)
    a_max = np.max(a, axis=axis, keepdims=True)
    out = np.log(np.sum(np.exp(a - a_max), axis=axis, keepdims=True)) + a_max
    if keepdims:
        return out
    if axis is None:
        return out.squeeze()
    return np.squeeze(out, axis=axis)


def log_softmax(z, axis=1):
    z = np.asarray(z, dtype=float)
    return z - logsumexp(z, axis=axis, keepdims=True)


def _weighted_mean(values, sample_weight=None):
    values = np.asarray(values, dtype=float)
    if sample_weight is None:
        return float(np.mean(values))

    w = np.asarray(sample_weight, dtype=float)
    if w.shape != values.shape:
        raise ValueError('sample_weight must have the same shape as values')
    w_sum = np.sum(w)
    if w_sum <= 0:
        raise ValueError('sample_weight must sum to a positive number')
    return float(np.sum(w * values) / w_sum)


def log_loss_binary(y_true, y_prob, *, eps=1e-15, sample_weight=None):
    """Binary log loss from probabilities.

    Parameters
    - y_true: shape (n,), values in {0,1}
    - y_prob: shape (n,), predicted P(y=1|x)
    """
    y_true = np.asarray(y_true)
    y_prob = np.asarray(y_prob, dtype=float)

    if y_true.shape != y_prob.shape:
        raise ValueError('y_true and y_prob must have the same shape')

    if np.any((y_true != 0) & (y_true != 1)):
        raise ValueError('y_true must contain only 0/1 labels')

    p = np.clip(y_prob, eps, 1.0 - eps)
    losses = -(y_true * np.log(p) + (1 - y_true) * np.log(1 - p))
    return _weighted_mean(losses, sample_weight=sample_weight)


def log_loss_multiclass(y_true, y_prob, *, eps=1e-15, sample_weight=None):
    """Multiclass log loss from probabilities.

    Parameters
    - y_true: shape (n,), integer labels in {0,1,...,K-1}
    - y_prob: shape (n,K), predicted class probabilities (rows should sum to 1)
    """
    y_true = np.asarray(y_true)
    y_prob = np.asarray(y_prob, dtype=float)

    if y_prob.ndim != 2:
        raise ValueError('y_prob must be a 2D array of shape (n_samples, n_classes)')

    n_samples, n_classes = y_prob.shape
    if y_true.shape != (n_samples,):
        raise ValueError('y_true must have shape (n_samples,)')

    if np.any((y_true < 0) | (y_true >= n_classes)):
        raise ValueError('y_true contains labels outside [0, n_classes)')

    p = np.clip(y_prob, eps, 1.0 - eps)
    p = p / p.sum(axis=1, keepdims=True)
    losses = -np.log(p[np.arange(n_samples), y_true])
    return _weighted_mean(losses, sample_weight=sample_weight)


def log_loss_binary_from_logits(y_true, logits, *, sample_weight=None):
    """Binary log loss from logits: softplus(z) - y*z."""
    y_true = np.asarray(y_true)
    logits = np.asarray(logits, dtype=float)

    if y_true.shape != logits.shape:
        raise ValueError('y_true and logits must have the same shape')
    if np.any((y_true != 0) & (y_true != 1)):
        raise ValueError('y_true must contain only 0/1 labels')

    losses = softplus(logits) - y_true * logits
    return _weighted_mean(losses, sample_weight=sample_weight)


def log_loss_multiclass_from_logits(y_true, logits, *, sample_weight=None):
    """Multiclass log loss from logits: -log softmax(true_class)."""
    y_true = np.asarray(y_true)
    logits = np.asarray(logits, dtype=float)

    if logits.ndim != 2:
        raise ValueError('logits must be a 2D array of shape (n_samples, n_classes)')

    n_samples, n_classes = logits.shape
    if y_true.shape != (n_samples,):
        raise ValueError('y_true must have shape (n_samples,)')
    if np.any((y_true < 0) | (y_true >= n_classes)):
        raise ValueError('y_true contains labels outside [0, n_classes)')

    log_probs = log_softmax(logits, axis=1)
    losses = -log_probs[np.arange(n_samples), y_true]
    return _weighted_mean(losses, sample_weight=sample_weight)

2) Intuition (plots)#

For binary classification:

  • if the true label is 1, the loss is \(-\log(p)\)

  • if the true label is 0, the loss is \(-\log(1-p)\)

So being confidently wrong is punished heavily (the loss goes to \(+\infty\) as the predicted probability goes to 0 for the true class).

A key property: log loss is a strictly proper scoring rule. If the true label is Bernoulli with positive rate \(q\), then the expected loss of predicting \(p\) is:

\[ \mathbb{E}[\ell(p)] = -q\log(p) - (1-q)\log(1-p) \]

This is the cross-entropy \(H(q,p)\) and it is minimized at \(p=q\).

eps = 1e-6
p = np.linspace(eps, 1 - eps, 800)

loss_y1 = -np.log(p)
loss_y0 = -np.log(1 - p)

q = 0.7
expected_loss = -(q * np.log(p) + (1 - q) * np.log(1 - p))

fig = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=(
        'Per-sample log loss as a function of predicted probability',
        'Expected log loss when the true positive rate is q',
    ),
)

fig.add_trace(go.Scatter(x=p, y=loss_y1, name='y=1: -log(p)'), row=1, col=1)
fig.add_trace(go.Scatter(x=p, y=loss_y0, name='y=0: -log(1-p)'), row=1, col=1)
fig.update_xaxes(title_text='predicted probability p', row=1, col=1)
fig.update_yaxes(title_text='loss', row=1, col=1)

fig.add_trace(go.Scatter(x=p, y=expected_loss, name='E[loss]'), row=1, col=2)
fig.add_vline(x=q, line_width=2, line_dash='dash', line_color='black', row=1, col=2)
fig.add_annotation(
    x=q,
    y=float(expected_loss[np.argmin(np.abs(p - q))]),
    text='minimum at p=q',
    showarrow=True,
    arrowhead=2,
    ax=40,
    ay=-30,
    row=1,
    col=2,
)
fig.update_xaxes(title_text='predicted probability p', row=1, col=2)
fig.update_yaxes(title_text='expected loss', row=1, col=2)

fig.update_layout(height=420, legend=dict(orientation='h', yanchor='bottom', y=1.02))
fig.show()

3) NumPy implementation: quick sanity checks#

A small example showing why log loss is sensitive to a single confident mistake.

y_true = np.array([1, 1, 1, 0, 0, 0])

p_good = np.array([0.9, 0.8, 0.7, 0.3, 0.2, 0.1])
p_one_confident_mistake = np.array([0.9, 0.8, 0.01, 0.3, 0.2, 0.99])

print('mean log loss (good):', log_loss_binary(y_true, p_good))
print('mean log loss (one confident mistake):', log_loss_binary(y_true, p_one_confident_mistake))
print('sklearn check:', sk_log_loss(y_true, p_good))

eps = 1e-15
losses_good = -(y_true * np.log(np.clip(p_good, eps, 1 - eps)) + (1 - y_true) * np.log(1 - np.clip(p_good, eps, 1 - eps)))
losses_bad = -(y_true * np.log(np.clip(p_one_confident_mistake, eps, 1 - eps)) + (1 - y_true) * np.log(1 - np.clip(p_one_confident_mistake, eps, 1 - eps)))

fig = go.Figure()
fig.add_trace(go.Bar(x=np.arange(len(y_true)), y=losses_good, name='per-sample loss (good)'))
fig.add_trace(go.Bar(x=np.arange(len(y_true)), y=losses_bad, name='per-sample loss (one confident mistake)'))
fig.update_layout(
    barmode='group',
    title='A single confident mistake can dominate mean log loss',
    xaxis_title='sample index',
    yaxis_title='per-sample loss',
)
fig.show()

p_baseline = np.full_like(y_true, y_true.mean(), dtype=float)
print('baseline (predict base rate p=mean(y)):', log_loss_binary(y_true, p_baseline))
mean log loss (good): 0.22839300363692283
mean log loss (one confident mistake): 1.6864438223668599
sklearn check: 0.22839300363692283
baseline (predict base rate p=mean(y)): 0.6931471805599453

Multiclass example#

For multiclass problems you pass a probability matrix of shape (n_samples, n_classes). The loss for each sample is simply -log(probability_assigned_to_the_true_class).

y_true_mc = np.array([0, 2, 1, 2])
P = np.array(
    [
        [0.7, 0.2, 0.1],
        [0.1, 0.2, 0.7],
        [0.2, 0.6, 0.2],
        [0.05, 0.05, 0.9],
    ]
)

print('multiclass log loss (numpy):', log_loss_multiclass(y_true_mc, P))
print('multiclass log loss (sklearn):', sk_log_loss(y_true_mc, P, labels=[0, 1, 2]))

# log-loss from logits should match log-loss from probabilities
Z = np.log(P)
print('multiclass log loss from logits (numpy):', log_loss_multiclass_from_logits(y_true_mc, Z))
multiclass log loss (numpy): 0.33238400682532043
multiclass log loss (sklearn): 0.3323840068253205
multiclass log loss from logits (numpy): 0.33238400682532054

4) Using log loss to optimize logistic regression (NumPy)#

Binary logistic regression models:

\[ z_i = w^\top x_i + b,\quad p_i = \sigma(z_i) \]

and minimizes the average log loss:

\[ J(w,b) = \frac{1}{n}\sum_{i=1}^n \Big(-y_i\log(p_i) - (1-y_i)\log(1-p_i)\Big) \]

A very useful fact for optimization is that the derivative w.r.t. the logit is:

\[ \frac{\partial \ell_i}{\partial z_i} = p_i - y_i \]

So the gradients are:

\[ \nabla_w J = \frac{1}{n} X^\top (p - y),\quad \frac{\partial J}{\partial b} = \frac{1}{n}\sum_{i=1}^n (p_i - y_i) \]

Below is a simple gradient descent optimizer that learns w and b by directly minimizing log loss.

def standardize_fit_transform(X):
    X = np.asarray(X, dtype=float)
    mean = X.mean(axis=0)
    std = X.std(axis=0)
    std = np.where(std == 0, 1.0, std)
    return (X - mean) / std, mean, std


def standardize_transform(X, mean, std):
    X = np.asarray(X, dtype=float)
    std = np.where(std == 0, 1.0, std)
    return (X - mean) / std


def fit_logistic_regression_gd(
    X_train,
    y_train,
    X_val=None,
    y_val=None,
    *,
    lr=0.2,
    n_steps=300,
    l2=0.0,
):
    X_train = np.asarray(X_train, dtype=float)
    y_train = np.asarray(y_train)
    n_samples, n_features = X_train.shape

    w = np.zeros(n_features)
    b = 0.0

    history = {
        'step': [],
        'train_loss': [],
        'train_acc': [],
        'val_loss': [],
        'val_acc': [],
    }

    for step in range(n_steps):
        logits = X_train @ w + b
        p = sigmoid(logits)

        loss = log_loss_binary(y_train, p)
        grad_w = (X_train.T @ (p - y_train)) / n_samples + l2 * w
        grad_b = float(np.mean(p - y_train))

        w -= lr * grad_w
        b -= lr * grad_b

        pred = (p >= 0.5).astype(int)
        acc = float(np.mean(pred == y_train))

        history['step'].append(step)
        history['train_loss'].append(loss)
        history['train_acc'].append(acc)

        if X_val is not None and y_val is not None:
            logits_val = X_val @ w + b
            p_val = sigmoid(logits_val)
            history['val_loss'].append(log_loss_binary(y_val, p_val))
            history['val_acc'].append(float(np.mean((p_val >= 0.5).astype(int) == y_val)))

    return w, b, history
X, y = make_blobs(
    n_samples=800,
    centers=2,
    n_features=2,
    cluster_std=2.2,
    random_state=0,
)

X_train, X_val, y_train, y_val = train_test_split(
    X,
    y,
    test_size=0.3,
    random_state=0,
    stratify=y,
)

X_train_s, mean, std = standardize_fit_transform(X_train)
X_val_s = standardize_transform(X_val, mean, std)

w, b, hist = fit_logistic_regression_gd(
    X_train_s,
    y_train,
    X_val=X_val_s,
    y_val=y_val,
    lr=0.2,
    n_steps=250,
)

fig = make_subplots(specs=[[{'secondary_y': True}]])
fig.add_trace(go.Scatter(x=hist['step'], y=hist['train_loss'], name='train log loss'), secondary_y=False)
fig.add_trace(go.Scatter(x=hist['step'], y=hist['val_loss'], name='val log loss'), secondary_y=False)
fig.add_trace(go.Scatter(x=hist['step'], y=hist['train_acc'], name='train accuracy'), secondary_y=True)
fig.add_trace(go.Scatter(x=hist['step'], y=hist['val_acc'], name='val accuracy'), secondary_y=True)

fig.update_xaxes(title_text='gradient descent step')
fig.update_yaxes(title_text='log loss (lower is better)', secondary_y=False)
fig.update_yaxes(title_text='accuracy', range=[0, 1], secondary_y=True)
fig.update_layout(title='Minimizing log loss trains logistic regression', height=420)
fig.show()

print('final train loss:', hist['train_loss'][-1])
print('final val loss:', hist['val_loss'][-1])
final train loss: 0.45572277119982485
final val loss: 0.4442069329282946
x0_min, x0_max = X_train_s[:, 0].min() - 0.8, X_train_s[:, 0].max() + 0.8
x1_min, x1_max = X_train_s[:, 1].min() - 0.8, X_train_s[:, 1].max() + 0.8

x0 = np.linspace(x0_min, x0_max, 220)
x1 = np.linspace(x1_min, x1_max, 220)
xx0, xx1 = np.meshgrid(x0, x1)
grid = np.c_[xx0.ravel(), xx1.ravel()]

prob_grid = sigmoid(grid @ w + b).reshape(xx0.shape)

fig = go.Figure()
fig.add_trace(
    go.Contour(
        x=x0,
        y=x1,
        z=prob_grid,
        contours=dict(start=0.0, end=1.0, size=0.1),
        colorscale='RdBu',
        opacity=0.85,
        colorbar=dict(title='P(y=1)'),
        name='P(y=1)',
    )
)
fig.add_trace(
    go.Contour(
        x=x0,
        y=x1,
        z=prob_grid,
        contours=dict(start=0.5, end=0.5, size=0.5),
        showscale=False,
        line=dict(color='black', width=3),
        hoverinfo='skip',
        name='decision boundary (p=0.5)',
    )
)

fig.add_trace(
    go.Scatter(
        x=X_train_s[:, 0],
        y=X_train_s[:, 1],
        mode='markers',
        name='train',
        marker=dict(
            size=6,
            color=y_train,
            cmin=0,
            cmax=1,
            colorscale=[[0, '#1f77b4'], [1, '#d62728']],
            line=dict(width=0.5, color='black'),
        ),
    )
)

fig.add_trace(
    go.Scatter(
        x=X_val_s[:, 0],
        y=X_val_s[:, 1],
        mode='markers',
        name='val',
        marker=dict(
            size=8,
            symbol='x',
            color=y_val,
            cmin=0,
            cmax=1,
            colorscale=[[0, '#1f77b4'], [1, '#d62728']],
            line=dict(width=1.0, color='black'),
        ),
    )
)

fig.update_layout(
    title='Decision boundary after minimizing log loss (standardized feature space)',
    xaxis_title='feature 1 (standardized)',
    yaxis_title='feature 2 (standardized)',
    height=520,
)
fig.show()

5) Pros, cons, pitfalls#

Pros#

  • Uses probabilities: rewards calibrated predictions, not just correct hard labels.

  • Strictly proper scoring rule: in expectation, you minimize it by predicting the true conditional probabilities.

  • Differentiable: works naturally as a training objective (logistic regression, neural nets, softmax models).

  • Works for multiclass: via categorical cross-entropy.

Cons / caveats#

  • Harder to interpret than accuracy (units are nats if using natural logs).

  • Unbounded above: a few confidently wrong predictions can dominate the mean.

  • Sensitive to label noise: mislabeled points can produce very large losses if the model is confident.

  • Requires good probability estimates; models that only rank well (AUC) can still have poor log loss.

Common pitfalls#

  • Passing hard class labels instead of probabilities (log loss expects probabilities).

  • For multiclass, probability rows must align with label order and sum to 1.

  • With scikit-learn, if y_true contains only one class, pass labels=[...] to define the full label set.

  • Not clipping probabilities leads to log(0) and infinite loss; use a small \(\varepsilon\).

Where it is a good fit#

  • When you care about probability quality: risk estimation, triage systems, cost-sensitive decisions.

  • When you want an evaluation metric that matches the training objective for probabilistic classifiers.

  • When comparing calibrated models (often alongside calibration curves / Brier score).

Exercises#

  • Derive \(\partial \ell / \partial z = p - y\) for the binary case.

  • Implement multiclass gradient descent for softmax regression using log_loss_multiclass_from_logits.

  • Compare log loss and accuracy on an imbalanced dataset; notice that accuracy can look good even with poor probabilities.

References#

  • scikit-learn: sklearn.metrics.log_loss https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html

  • Cross-entropy and negative log-likelihood (NLL): https://en.wikipedia.org/wiki/Cross_entropy

  • Proper scoring rules: https://en.wikipedia.org/wiki/Scoring_rule